Reinforcement Learning: An Introduction: Bridging the Gap: The Intuition of TD Learning

Temporal-difference (TD) learning represents a paradigm shift in reinforcement learning. It bridges the gap between the raw sampling of Monte Carlo methods and the mathematical elegance of Dynamic Programming. At its core, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

The Driving Home Analogy

Imagine you are driving home. In a Monte Carlo world, you only update your belief about your commute time once you step through your front door. If you hit a massive traffic jam 10 minutes in, you just sit there, unable to 'learn' until the journey ends. In the TD Learning world, the moment you see those brake lights, you immediately adjust your estimate of your total travel time. You don't need the final outcome to know your initial prediction was wrong.

The Bucket Brigade Intuition

Think of one-step, tabular, model-free TD methods as a bucket brigade. Instead of one person running from the fire to the well, a line of people passes buckets of information back. As soon as State B is reached, its value is used to correct State A’s value. This incremental nature allows for faster convergence and enables learning in continuing tasks that have no natural end.

Deep Intuition: TD vs. Monte Carlo

Scenario Analysis & Logic

Consider the 'Driving Home' example and how it is addressed by TD and Monte Carlo methods. Imagine a scenario in which a TD update would be better on average than a Monte Carlo update.

Give an example scenario—a description of past experience and a current state—in which you would expect the TD update to be better.

Answer:
Scenario: You have traveled the second half of your commute (from the highway to home) hundreds of times and have a very accurate estimate of its duration ($V(S_{highway}) = 20$ minutes). Today, you are only 5 minutes into your journey when you encounter an unexpected road closure (the current state). TD allows you to immediately update your 'Leaving Office' value using your highly reliable $V(S_{highway})$ estimate. A Monte Carlo method would wait until you eventually get home, potentially hours later. If you encounter further random delays (e.g., a slow truck) during that long wait, MC will erroneously attribute some of that noise to the 'Leaving Office' state, whereas TD cleanly isolates the impact of the road closure by leveraging the stable downstream estimate.

Windy Gridworld with King’s Moves: How much better can you do with the extra actions? Can you do even better by including a ninth action that causes no movement at all other than that caused by the wind?

Answer:
With King's moves (8 actions), the agent can move diagonally, which typically allows for shorter, more direct paths to the goal by 'cutting corners' or more efficiently counteracting the vertical force of the wind. A ninth 'stay' action (zero movement) can be even better in high-wind regions; it allows the agent to essentially 'ride' the wind to a desired vertical position without being forced to move horizontally, which can be optimal when the goal is directly 'downwind' or 'upwind' of the current column.